Sampling

Elizabeth King
Kevin Middleton

Sampling to avoid complicated math

\[\textrm{posterior} = \frac{\textrm{prior} \cdot \textrm{likelihood}}{\textrm{normalizing constant}}\]

Many ways to sample

  • Metropolis algorithm (1953)
  • Metropolis-Hastings algorithm (1970)
  • Gibbs sampling (1984)
  • Hamiltonian Monte Carlo (1987, 1996, 2014)

Sampling is not optimization

  • Sampling yields a set of samples
    • n chains \(\times\) n iterations
  • Samples represent the most plausible values in proportion to their relative compatibility with the data, model, and priors.

The posterior contains simultaneous samples for each parameter.

Basics of MCMC

  1. Current parameter values (have a model likelihood)
  2. Propose a new set of values based a slight permutation of the current values (have a model likelihood)
  3. Move to proposed values with probability proportional to their ratio
  4. Repeat

“Bad” jumps are possible with non-zero probability

  • Not optimization or “hill-climbing”

Simulate data

  • Generate a small dataset
  • Easier to look at likelihoods
set.seed(467456)
(y <- rnorm(n = 8, mean = 50, sd = 2))
[1] 44.88692 48.25894 48.44770 50.16844 49.55441 49.64544 48.49286 48.91631
mean(y)
[1] 48.54638
sd(y)
[1] 1.624759

Choose starting values

set.seed(45738)
(mu <- rnorm(1, mean = 45))
[1] 46.01168
(sigma <- rtruncnorm(1, a = 0))
[1] 0.5373229

Calculate the probability for \(y_i\)

Use dnorm() to get the normal density for two observations of \(y\) given the current values of \(\mu\) and \(\sigma\)

dnorm(y[1], mean = mu, sd = sigma, log = TRUE)
[1] -2.488659
dnorm(y[2], mean = mu, sd = sigma, log = TRUE)
[1] -9.043736

Model likelihood is the product of the probabilities for all the observations

  • Use a log scale
  • Add instead of multiply

Calculate model log-likelihood

Add up the logged probabilities for each observation (log = TRUE):

(LL_current <- sum(dnorm(y, mean = mu, sd = sigma, log = TRUE)))
[1] -123.3949

Now what?

  • We need something to compare this likelihood to: a proposal

Proposal

  • Add a small random amount to each value
  • Calculate the model likelihood
set.seed(52224241)
(mu_prop <- mu + rnorm(1, 0, 0.01))
[1] 46.01393
(sigma_prop <- sigma + rnorm(1, 0, 0.001))
[1] 0.5366165
if (sigma_prop < 0) sigma_prop <- abs(sigma_prop)
(LL_prop <- sum(dnorm(y, mean = mu_prop, sd = sigma_prop, log = TRUE)))
[1] -123.5445

Evaluate the proposal

Take the ratio of the likelihoods by subtraction

exp(LL_prop - LL_current)
[1] 0.8610612
  • If the ratio is > 1
    • The proposal is better
    • Make the move
  • If the ratio is < 1
    • Draw a random number between 0 and 1: runif(1)
    • If the ratio is > runif(1), make the move
    • Otherwise draw a new proposal

Evaluating the proposal

set.seed(34297)
runif(1)
[1] 0.8803688
exp(LL_prop - LL_current)
[1] 0.8610612
  • Do not make the move.
    • Save the current values.
  • Move to the next iteration.

Metropolis Sampling Setup

set.seed(435739)
n_samp <- 10000

s <- matrix(NA, ncol = 2, nrow = n_samp)
s[1, ] <- c(1.011675, 0.5373229)

n_accepted <- 0
n_rejected <- 0

Run the sampler

for (ii in 2:n_samp) {
  LL_current <- sum(dnorm(y, mean = s[ii - 1, 1], sd = s[ii - 1, 2],
                          log = TRUE))

  mu_prop <- s[ii - 1, 1] + rnorm(1, 0, 1)
  sigma_prop <- s[ii - 1, 2] + rnorm(1, 0, 1)
  if (sigma_prop < 0) sigma_prop <- abs(sigma_prop)
  
  LL_prop <- sum(dnorm(y, mean = mu_prop, sd = sigma_prop, log = TRUE))

  if (exp(LL_prop - LL_current) > runif(1)) {
    s[ii, ] <- c(mu_prop, sigma_prop)
    mu <- mu_prop
    sigma <- sigma_prop
    n_accepted <- n_accepted + 1
  } else {
    s[ii, ] <- s[ii - 1, ]
    n_rejected <- n_rejected + 1
  }
}

Examine the output

head(s, n = 10)
            [,1]      [,2]
 [1,]  1.0116750 0.5373229
 [2,] -0.9489727 0.9602566
 [3,] -0.9489727 0.9602566
 [4,] -0.9489727 0.9602566
 [5,] -0.9489727 0.9602566
 [6,] -0.9489727 0.9602566
 [7,]  0.7625157 2.2189009
 [8,]  0.7625157 2.2189009
 [9,]  1.7529252 2.9303900
[10,]  0.2081370 3.3238409
n_accepted
[1] 3822
n_rejected
[1] 6177

Plotting the chain

Examining the chain

Examining the chain

Problems

  • Chains can take a while to begin oscillating around the optimum
    • Discard samples for burn-in
  • Autocorrelation
    • Save every nth sample (thinning)

Chain diagnostics

            [,1]      [,2]
Lag 0  1.0000000 1.0000000
Lag 1  0.9955104 0.9978622
Lag 5  0.9800271 0.9919311
Lag 10 0.9647656 0.9866262
Lag 50 0.8500191 0.9466300

Naive improvements

  • Take a lot more samples (106)
  • Discard the initial 25% of the samples
  • Keep every 100th sample
s <- s[(n_samp * 0.25):n_samp, ]         # Drop 25% for burnin
s_thin <- s[seq(1, nrow(s), by = 100), ] # Thin every 100th sample
s_mcmc <- as.mcmc(s_thin)
autocorr.diag(s_mcmc)
               [,1]          [,2]
Lag 0   1.000000000  1.0000000000
Lag 1  -0.000697156  0.0009217929
Lag 5  -0.008787172 -0.0045931536
Lag 10  0.019425561  0.0142671041
Lag 50  0.005370590 -0.0066584641

Plotting

Zooming

Samples to a distribution

Summarizing

Post |> 
  group_by(Parameter) |> 
  summarize(Median = median(Estimate),
            Q2.5 = quantile(Estimate, 0.025),
            Q97.5 = quantile(Estimate, 0.975))
# A tibble: 2 × 4
  Parameter Median  Q2.5 Q97.5
  <chr>      <dbl> <dbl> <dbl>
1 mu         48.6  47.0  50.1 
2 sigma       1.85  1.13  3.79
median(y)
[1] 48.70458
sd(y)
[1] 1.624759

Bayesian inference by Monte Carlo sampling

  • More complicated than simple MCMC
    • Include the prior
  • Challenging when there are many predictors
    • High proportion of rejected moves
  • Gibbs sampling improves performance
    • Sample one parameter at a time within each step
    • Still can have high autocorrelation of parameter estimates

Hamiltonian Monte Carlo

Don’t write you own MCMC sampler

Leave it to the professionals.

References

Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. 2017. Stan: A Probabilistic Programming Language. J. Stat. Softw. 76:1–32.
Hoffman, M. D., and A. Gelman. 2014. The No-U-turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15:1593–1623.